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Abstract 

In this paper we propose a non-metric ranking- 
based representation of semantic similarity that 
allows natural aggregation of semantic informa¬ 
tion from multiple heterogeneous sources. We 
apply the ranking-based representation to zero- 
shot learning problems, and present deterministic 
and probabilistic zero-shot classifiers which can 
be built from pre-trained classifiers without re¬ 
training. We demonstrate their the advantages on 
two large real-world image datasets. In particu¬ 
lar, we show that aggregating different sources of 
semantic information, including crowd-sourcing, 
leads to more accurate classification. 

1. Introduction 

In standard multiclass classification settings, classes are 
treated as a categorical set without any extra stracture. 
When we have side-information on the structure of classes, 
such as semantic relatedness, we can use this information 
to improve the classification itself, or transfer any knowl¬ 
edge learned from the training domain to solve problems in 
a new domain. 

Consider a classification problem of the following 10 visual 
objects; airplane, automobile, bird, cat, deer, dog, frog, 
horse, ship, and truck. There are many sources from which 
semantic information for those objects can be obtained. 
WordNet is a knowledge-base of semantic hierarchies de¬ 
veloped manually by linguistic experts (Miller, 1995). In 
WordNet, objects form a hierarchical tree (Eigure 1, Left), 
where a child object is ‘a kind of’ its parent object. Sev¬ 
eral similarity metrics can be defined from the hierarchy 
', one of which is shown in Eigure 1 (Middle) as a two- 
dimensional classical multidimensional scaling (MDS) em¬ 
bedding. Semantic relatedness can also be mined automat¬ 
ically from existing corpora, such as Wikipedia, Google N- 

*http : //maraca . d . umn . edu/similarity/ 
measures . html 


Gram corpus, or using search engines, where cosine an¬ 
gles of co-occurrence vectors can be used as a similarity of 
two words. More recently, elaborate methods for learning 
vectorial representations of words have also been proposed 
(Huang et al., 2012; Mikolov et al., 2013; Pennington et al., 
2014). Eigure 1 (Right) is an example MDS embedding 
from the representation from (Huang et al., 2012). As can 
be seen from the figure, similarity of the same objects can 
look very different depending on which semantic source 
and measure is used. 

Non-metric representation of similarity. Multiple 
sources of semantic information have the potential to com¬ 
plement each other for an improved classification result. 
Still, how to best aggregate similarity from inhomogeneous 
sources remains an open problem. Similarity measures 
from different corpora or methods are not directly compa¬ 
rable, and therefore a simple averaging of the measures will 
not be optimal. The first key idea of our paper is that we 
use non-metric, ranking-based representation of semantic 
similarity, instead of numerical representation. 

To illustrate our approach, consider the problem of distin¬ 
guishing cat and truck. In Eigure 1 (Middle), cat is closer 
to dog than automobile: 

d{cat, dog) < d{cat, automobile), 

and truck is closer to automobile than to dog: 

dltruck, automobile) < d{truck, dog)). 

In other words, we may be able to distinguish cat and truck 
from their closeness to other reference objects without us¬ 
ing any numerical similarity. As a special case, we can use 
the similarity of all the other objects to cat to form a seman¬ 
tic ranking of cat. Eor example, cat has a semantic ranking 

Ttcat = [horse, deer, ■ ■ ■ , automobile, airplane], (1) 
and truck has a semantic ranking 

Tttruck = [automobile, ship, ■ ■ ■ , deer, horse], (2) 
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Figure 1. Representations of semantic relatedness of 10 objects. Left: hierarchical tree from WordNet. Middle: MDS embedding using 
Wu & Palmer metric. Right: MDS embedding from (Huang et al., 2012). Note that the two embeddings from different similarity 
measures look very different (Middle vs Right). Some labels are hidden to avoid clutter. 


according to the distance in Figure 1 (Middle). Not only the 
ordinal similarity may be sufficient for distinguishing cat 
and truck, but it also seems a more natural representation, 
since the ordinal similarity is invariant under scaling and 
monotonic transformation of numerical values and there¬ 
fore has a better chance of being consistent across different 
heterogeneous sources. Moreover, ordinal information can 
be obtained directly from non-numerical comparisons. In 
particular, when we ask human subjects to judge similar¬ 
ity of objects, it is easier for subjects to rank objects rather 
than to assign numerical scores of similarity. 

Zero-shot classification without retraining. In this pa¬ 
per, we apply non-metric rankings-based representations 
of semantic similarity to zero-shot classification prob¬ 
lems (Palatucci et al., 2009; Lampert et al., 2009; Rohrbach 
et al., 2010; 2011; Qi et al., 2011; Mensink et al., 2012; 
Frome et al., 2013; Socher et al., 2013). In zero-shot learn¬ 
ing we have samples {{xi,yi)} from the domain X x y 
(e.g., y is the set of 8 objects), but no samples from the test 
domain X x Z (e.g., Z = {cat, truck}). The goal is to 
construct a classifier X ^ Z using the only training data 
{{xi,yi)} and semantic knowledge of the two domains y 
and Z. 

A standard approach to classifying C classes is to use bi¬ 
nary classifications in one-vs-rest or one-vs-one setting, 
or to use multiclass losses directly. If we already have 
pre-trained classifiers of the training domain classes y us¬ 
ing one of those settings, can we use those classifiers ‘for 
free’ to distinguish unseen classes cat and truck without re¬ 
training with training domain samples? Figure 2 provides 
an intuition on the problem. Consider multiple decision hy¬ 
perplanes learned from the one-vs-one setting (others will 
be discussed in Section 2.) The hyperplanes par¬ 

tition the feature space into ‘cells’, each of which assigns 
a ranking of C objects to points inside its interior. To see 
this, note that all pairs of objects are compared in each cell 
(either i < j or j < i), and transitivity (see Section 2) 


follows the metric triangle inequality. The ranking of an 
unseen test sample assigned by pre-trained classifiers can 
be compared with the semantic rankings of cat or truck 
for zero-shot classification, assuming feature and semantic 
similarities are strongly correlated (see (Deselaers & Fer¬ 
rari, 2011) for a discussion). 


One-vs-one 



Figure 2. Decision hyperplanes for classifying 8 objects partition 
the feature space into cells that correspond to rankings (see text). 
The lines are ^ separating hyperplanes from the one-vs-one bi¬ 
nary classification setting. 

Building on this idea, we present novel zero-shot classifi¬ 
cation methods that are free of re-training and can aggre¬ 
gate semantic information from multiple sources. We start 
by proposing a simple deterministic ranking-based method, 
and further improve the method by introducing probabil¬ 
ity models of rankings. In the probabilistic approach, real¬ 
valued classification scores are mapped to posterior prob¬ 
abilities of rankings, and combined with prior probabil¬ 
ity of rankings learned from (multiple) semantic sources. 
The advantage of using probabilistic approach will be ex¬ 
plained more in the method and the experiment sections. 
For both the posterior and the prior probabilities of rank¬ 
ings, we use classic probabilistic models of ranking includ¬ 
ing the Plackett-Luce, the Mallows, and the Babington- 
Smith models. 
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To summarize the contributions of this paper, we present 

1. non-metric ranking-based representation of a seman¬ 
tic structure, alternative to numerical similarity repre¬ 
sentation 

2. methods of aggregating multiple semantic sources us¬ 
ing probability models of rankings 

3. deterministic and probabilistic zero-shot classifiers 
built from pre-trained classifiers without retraining. 

In the experiment section we demonstrate the advantages 
of our approach over a numerically-based approach and 
a deterministic approach using two well-known image 
databases Animals-with-attributes (Lampert et al., 2009) 
and CIFAR-10/100 (Krizhevsky, 2009). In particular, we 
demonstrate that aggregating different semantic sources, 
including crowd-sourcing, leads to more accurate zero-shot 
classification. 

The remainder of the paper is organized as follows: In Sec¬ 
tion 2, we present deterministic and probabilistic ranking- 
based algorithms for zero-shot classification. In Section 3, 
we relate our work to others in the literature. In Section 4, 
we test our methods with real-world image databases, and 
conclude the paper in Section 5. 

2. Zero-shot learning with rankings 

Notations. Let TZ denote the set of all rankings on C 
items/classes, andyr = [7r(l),..., 7r(C)] G 7^ denote a rank¬ 
ing: 7r(i) is the position of item z and 7r‘'(j) is the item 
number whose position is j. We write i -< j i‘i precedes 
j’) when 7r(i) < 7r(j) (‘item i is ranked higher than item 
j’.) A top-K ranking is a straightforward generalization 
of a ranking, in which the order of only the first K items 
7r'*(l),..., 7r'*(Ar) matter and the order of the remaining 
C — K items are ignored. With an abuse of notation, we 
use TT and 7?. as a top-K ranking and the as the set of all top- 
K rankings as well, since a full ranking is a special case 
{K — C.) A partial order is a further generalization of a 
ranking and a top-K ranking. In a (full) ranking, a pair of 
items (z, j), z ^ j has to satisfy either i < j or j < i, 
whereas it can be neither in a partial order. In addition, a 
partial order has to satisfy the transitivity: for any triple 
(z,j, fc), i < j and j < k implies i < k. Item positions 
7r( ) are in general undefined for a partial order. 

2.1. Deterministic approach 

A simple deterministic approach to zero-shot learning us¬ 
ing semantic rankings was already outlined in Introduction. 
In one-vs-one setting, pre-trained classifiers assign 

a ranking 7r(a;). In one-vs-rest, C binary classifiers assign 
C real-valued scores to a test point according to the point’s 
distances to C decision hyperplanes. The scores can be 


sorted to provide a ranking 7r(a;). Given this ranking 7r(a;) 
of a test sample x, and prior knowledge of semantic rank¬ 
ings {tTz I z S Z} of test-domain classes Z, we predict 

z{x) =argmin d{'K(x), tt^), (3) 

z^Z 

where is a distance between two rankings. For 

example, let Z = {cat, dog{ whose semantic rank¬ 
ings are (1) and (2), respectively. If an unseen im¬ 
age X has classification scores in the order 7r(a;) = 

[dog, deer, ■ ■ ■ , ship, airplane], so that d{'K(x), Treat) < 
d(7r(a;), irtruck) for some d{-, •), then we classify x as a cat 
rather than a truck. We use the Kendall’s ranking distance 
which is the number of mismatching orders: 

d{Tti,Tt2) = |{(i,j) I 7ri(z) > 7ri(j) A -K^ii) < 7r2(j)}|- 

(4) 

Sometimes it may make more sense to compare only the 
closest items than to compare all the items. The top-K ver¬ 
sion of the Kendall’s distance was proposed in (Critchlow, 

1985), which can be computed as follows. Let A, B, and 
D be the sets 

A = {z G 1, ..., C I 7ri(z) < K, 7r2(z) < K} 

B = {z G 1, ..., C I 7ri(z) < K, 7r2(z) > K} 

D = {z G 1, ..., C I 7ri(z) > K, TT2{i) > K}. 

Then the Kendall’s top-K distance can be computed by 

dKiTti,Tt2) = \{{i,j) G A X A\Tri{i)<Tri{j),TT2ii)>Tr2U)}\ 

+ |B|(C+7T-^^)-^^i(z)-5]7r2(zj(5) 

ieB ieD 

Zero-shot classification using the rule (3) will be called de¬ 
terministic ranking-based (DR) method. 

2.2. Probabilistic approach 

We can further refine ranking-based algorithms by consid¬ 
ering a probabilistic approach. There are several causes of 
uncertainty in ranking-based representation. First, classi¬ 
fier outputs for a test-domain sample x can have low confi¬ 
dence, since the classifiers are trained only with training- 
domain samples. Second, prior knowledge of semantic 
rankings from multiple semantic sources may not be unan¬ 
imous. Third, feature and semantic similarities do not al¬ 
ways coincide. For these reasons, we consider probabil¬ 
ity models of (top-K) rankings. We discuss three models: 
the Mallows (Mallows, 1957), the Plackett-Luce (Plack- 
ett, 1975; Luce, 1959; Marden, 1995), and the Babington- 
Smith (Joe & Verducci, 1993; Smith, 1950), which we will 
introduce where they are needed (see (Critchlow, 1985; 
Critchlow et al., 1991; Marden, 1995) for more reviews.) 

In our probabilistic zero-shot learning approach, we as¬ 
sume the following dependence: 

@—K^)— 


( 6 ) 
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that is, the label z of a sample x is dependent only on the 
predicted ranking tt, which in turn is dependent only on the 
sample x. The probability of a test-domain label given the 
sample P{z\x) is obtained by marginalizing out the latent 
ranking variable tt; 

P{z\x) = P{'^\x)Piz\Tr) = -P(7r|a:) 


MLE. The log-likelihood of (8) is 


N K 


L{v) = 


n—1 i—1 


j=i 


-kGTZ 


(7) 

and the final prediction of the label z for a test sample is 
made by argmax^ P{z\x). 

There are two terms in (7): a probabilistic ranker P(7r|a;) 
and a prior for semantic ranking P(7r|z). First, we describe 
the prior for semantic ranking P(7r|z) learned from one or 
more semantic sources (e.g. different corpora or crowd¬ 
sourcing) in Section 2.3. Second, we describe probabilistic 
rankers P(7r|a;) based on standard classifiers trained with 
training-domain data in Section 2.4. The final zero-shot 
classifier for unseen samples bringing these two learned 
components is described in Section 2.5. 

2.3. Prior for semantic ranking 


with possibly an additional regularization term ri'Y^vf. 
Hunter (Hunter, 2004) proposed an iterative method of es- 
z) iimation using the Minorization-Maximization procedure 
which generalizes the Expectation-Maximization proce¬ 
dure and converges to a global maximum solution under 
a certain condition on the data. From our experience, sim¬ 
ple gradient-based or Newton-Raphson methods seem to 
work well with an appropriate choice of the regularization 
parameter. 

Mallows. The Mallows model (Mallows, 1957) for full 
rankings is defined as P(7r;7ro,A) oc g--*'where 
ttq is the mode, A is the spread parameter, and d(-, ■) 
is the Kendall’s distance between two rankings. It may 
be viewed as a discrete analog of the Gaussian distribu¬ 
tion for ranking. The distance can further be written as 
d(7r, TTo) = (((tttto ,e) = where e is the 

identity ranking and the Vj ’s are defined as 


We encode the semantic similarity between training- and 
test-domain classes by probabilistic ranking models of 
training-domain classes P(7r|z) for each test-domain class 
z. To learn P(7r|z), we use multiple instances of rankings 
for each test-domain class. These rankings can come from 
multiple linguistic corpora or by human-rated rankings di¬ 
rectly. Below we outline three popular models of rankings 
- the Plackett-Luce, the Mallows, and the Babington-Smith 
models. 

Plackett-Luce. The Plackett-Luce model for the probabil¬ 
ity of observing a top-K ranking tt is 


KjXo") (10) 

i>j 

Fligner et al. (Fligner & Verducci, 1986) proposed the Mal¬ 
lows model for top-K lists by marginalizing the Mallows 
model; 

P(7r;7ro, A) =(11) 

(PW 

where the Vj’s are defined in (10) and cj) is the normal¬ 
ization constant which can be computed in closed form; 

m = nf=i(l - e-(^-^+i)^)/(l - e-"). 


Pi^;v)=f[ . (8) 

i=l ^j=i 

The non-negative parameters vi,...,vc indicate the rela¬ 
tive chances of being ranked higher than the rest of the 
items, and are invariant under constant scaling of u’s. One 
interpretation of the generative procedure of the Plackett- 
Luce model is the Vase interpretation from (Silberberg, 
1980). Suppose there is a vase with infinite number of balls 
marked 1 to C, whose numbers are proportional to u’s. At 
the first stage, a ball is drawn and is recorded as tt”' (1). At 
the second stage, another ball is drawn and is recorded as 
7r'*(2) unless the ball is already selected before (7r‘'(l)), 
in which case the drawing is tried again. The procedure 
is continued until K distinct balls are drawn and recorded. 
This generative probability is captured by (8). 

Given N samples (=semantic sources) of rankings 
TTi,..., TTjv for a class, the parameters can be estimated by 


Given N samples of rankings ■■■,'Xn, the parame¬ 

ters of the Mallows model for total rankings can also be 
found by MLE (Feigin & Cohen, 1978). When the mode 
ttq is known, the MLE of the spread parameter A can be 
found by convex optimization, owing to the fact that the 
log-likelihood is a concave function of A. The MLE of the 
centroid ttq is the maximum of log Pi^^i', A) and is 
equivalent to 

argminy^d(7rj,7ro). (12) 

i 

The minimization (12) is also known as the Kemeny op¬ 
timal consensus or aggregation problem (Kemeny, 1959) 
and is known to be NP-hard (Bartholdi et al., 1989). How¬ 
ever, there are known heuristic methods such as sequential 
transposition of adjacent items (Critchlow, 1985) or other 
admissible heuristics (Meila et al., 2007). We use the for¬ 
mer method. Starting from the average ranking as the ini¬ 
tial value of TTo, and we search adjacent items 7r''(j) and 
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7r'*(i + 1) whose swapping lowers the sum of distances 
the most. We stop if there is no such item or if the max¬ 
imum number of iteration (1000 in our case) is exceeded. 
The MLE with (11) can be solved by using M in 

place of d. 

Babington-Smith. The Babington-Smith model (Joe & 
Verducci, 1993; Smith, 1950) is another probabilistic rank¬ 
ing model based on pairwise comparisons. Given two items 
i and j, let be the probability that item i is ranked higher 
than item j. Given these preferences {a^}, the probability 
of a ranking tt is 

P(7r;a) oc n*<, Af¬ 

ter introducing new parameters Vij = aij /aji (Joe & Ver¬ 
ducci, 1993), the probability of a top-K ranking can be writ¬ 
ten as ^ 


The Babington-Smith model is similar to the Plackett-Luce 
model in that the probability is the product of v’s. The 
larger Vij is, the more likely it is that item i is ranked higher 
than item j. However, unlike the Plackett-Luce model, the 
normalizing constant 'tp{v) does not have a known closed 
form. We do not use it for modeling the semantic prior, but 
use it for probabilistic ranker in the next section. 


2.4. Probabilistic ranker from classifiers 

The probabilistic ranker P(7r|a;) takes a sample x as input 
and probabilistically ranks the similarity of x to training- 
domain classes y. We propose to build rankers from stan¬ 
dard settings of multiclass classifiers; one-vs-rest, one-vs- 
one, or multiclass-loss as in (Crammer & Singer, 2002). 
Any classifier that output a real-valued confidence or score 
can be used for this purpose. 

One-vs-rest binary classifiers. In this setting, there will be 
C such scores fi{x),fc{x) for each training-domain 
class. We relate the real-valued scores {fi} and the non¬ 
negative parameters of the Plackett-Luce model (8) 
by setting Vi = to get 


P(7r|a:) 


K 

n 




L X / 

=1 e' 




(14) 


Instead of producing a single ranking tt as in the deter¬ 
ministic approach (3), this ranker evaluates the probability 
of any ranking given x taking into account the confidence 
({/i(a;)}) of C one-vs-rest classifiers. 

One-vs-one binary classifiers. In this setting, there will 

be C{C — l)/2 scores fi^ 2 {x), ■■■■, fc-i,c{x) for each 

^We presents a slightly modified form. 


pair of training-domain classes. We related these scores to 
the C{C — l)/2 parameters {vij} of the Babington-Smith 
model (13) by Vij = to get 

K c 

P(7r|a;) (15) 

i—1 j=i-\-l 


Similar to (14), this ranker evaluates the probabil¬ 
ity of any ranking tt given x taking into account 
the confidence ({fij{x)}) of C{C — l)/2 one-vs- 
one classifiers. Note that if the pre-trained classi¬ 
fiers are linear, that is, fij{x) = w'^^x, then this 

ranker is quite similar to (14), since P(7r|a;) oc 
e ^ , with 


nLnf= 


1 llj=2 + l — ' ' — ' 

defined as X]j=i-i-i However, it has 

a different normalization term from (14). 


Multiclass-loss classifiers. Other types of classifiers can 
be accommodated. When the pre-trained classifiers are 
multinomial logistic regression (=softmax) or S VMs with a 
multiclass loss (Crammer & Singer, 2002), we again have 
C scores fi{x),fcix) computed from C parameter 
vectors Wi,..., wq- Similar to the one-vs-rest case, we can 
use the relation Vi = with the Plackett-Luce 

model which gives us the same ranker as (14). Note that if 
the original classifier is a multinomial logistic regression, 
the (14) is in fact a direct generalization of logistic regres¬ 
sion for K = 1, which is also observed in (Cheng et al., 
2010). In this case, the trained parameters {wi} coincide 
with the optimal maximum likelihood parameters for (14) 
trained with top-1 rankings which are ground truth labels 
of the training domain. 

To summarize, there exist natural interpretations of the 
Plackett-Luce and the Babington-Smith models that allow 
us to relate classification scores to their parameters and use 
them to produce posterior probability P(7r|a;) of rankings 
without any further training^. 


2.5. Zero-shot prediction 

The probabilistic rankers P(7r|a;) constructed from pre¬ 
trained classifiers and the priors for semantic rankings 
P(7r|2:) learned from semantic sources are plugged into (7) 


P{z\x) = P{t:\x)P{z\t:) = Y 






E,P{z)P{7r\zy 


and the final prediction of the label z for a test sample 
X is made by argmax^ P{z\x). The sum over (top- 
K) rankings cannot be computed analytically for 

either of the Plackett-Luce and the Mallows models and 
requires approximations, e.g., by Markov chain Monte 

^It is not immediately clear whether the Mallows model can 
be adapted in this setting and is left for future work. 








Probabilistic Zero-shot Classiflcation with Semantic Rankings 


Carlo (MCMC) sampling. Alternatively, we use P{'k\z) = 
I[iT = ttq] and a uniform prior P{z), somewhat similar to 
(Rohrbach et al., 2010). In our preliminary experiments, 
MCMC-based summation showed inferior results to this 
simple version and therefore will be omitted from the re¬ 
port. The final zero-shot classifier is the MAP classifier 


arg max 


K 

n 


eA - g )-‘{0 


(x) 




E O J 

j=i ® 

for pre-trained one-vs-rest/multiclass-loss classifiers, 

K C 


(16) 


and arg max n n 

i-i j=i+i 




(x) 


(17) 


for pre-trained one-vs-one classifiers. We summarize the 
overall training and testing procedures below. 

Training Step 1. Obtain pre-trained classifiers 

• Input: training-domain sample and label pairs 

..., {xN,yN)}, regularization hyperparam¬ 
eter 

• Output: score functions /i,..., fc or /i, 2 , fc-i,c 

• Method: one-vs-rest/one-vs-one/multiclass with any 
classifier 

Training Step 2. Learn priors for semantic rankings 

• Input: ranking and test-domain label pairs 

{{■Ki, zi),{ ttm, zm)} collected from corpora 
or crowdsourcing 

• Output: consensus rankings vrg for each test-domain 
class z = 1,L from either the Plackett-Luce model 
(8) or the Mallows (11) 

• Method: MT.F. of (9) by BFGS or MLE of (12) by 
sequential transposition 

Testing. Zero-shot classification 

• Input: data x, parameter K for top-K list size 

• Output: prediction of test-domain label 2 

• Method: MAP estimation (16) or (17), using /( 2 ;)’s 
from Training Step 1 and tto’s from Training Step 2 


3. Related work 

There are two major approaches to zero-shot learning ex¬ 
plored in the literature: attribute-based and similarity- 
based. In attribute-based knowledge transfer (e.g., 
(Palatucci et al., 2009; Lampert et al., 2009; Akata et al., 
2013)), the classes from training and test domains are as¬ 
sumed to be distinguishable by a common list of attributes. 
Attribute-based approaches often show excellent empirical 
performance (Palatucci et al., 2009; Rohrbach et al., 2010). 
However, designing the attributes that are discriminative, 
common to multiple classes, and correlated with the origi¬ 
nal feature at the same time, can be a non-trivial task that 
typically requires human supervision. Similar arguments 


can be found in (Rohrbach et al., 2010) or (Mensink et al., 
2014). 

By contrast, similarity-based zero-shot approaches use se¬ 
mantic similarity between training-domain classes y and 
test-domain classes Z directly. The advantage of this ap¬ 
proach is that similarity information can be mined auto¬ 
matically from the web, linguistic corpora or other sources. 
Similarity information has been used to build a proba¬ 
bilistic zero-shot classifier called direct similarity-based 
method (DS) (Rohrbach et al., 2010; 2011), which par¬ 
allels the attribute-based approach from (Lampert et al., 
2009). Direct similarity-based method also uses classifi¬ 
cation scores and probabilistic inference as ours, but it uses 
numerical similarity instead of non-metric ranking presen¬ 
tation in our method. More recently, similarity-based ap¬ 
proaches using semantic embedding have been proposed 
(Frome et al., 2013; Socher et al., 2013). In these algo¬ 
rithms, training and test domain objects are simultaneously 
embedded into a semantic space using multilayer neural 
networks. While these two methods produce interesting 
results, they use specific metric similarity models and re¬ 
quire retraining when the semantic model changes, unlike 
our method. Mensink et al. use a linear combination of 
pre-trained classifiers for classifying unseen data (Mensink 
et al., 2014). They use co-occurrence statistics as seman¬ 
tic information, whereas we do not assume a specific type 
of similarity information. Lastly, our method provides a 
means to aggregate multiple semantic sources that has not 
been addressed in the literature. 


4. Experiments 

4.1. Datasets 

We use two datasets 1) Animals with Attributes dataset 
(Lampert et al., 2009) and 2) CIFAR-100/10 (Torralba 
et al., 2008) collected by (Krizhevsky, 2009). Seman¬ 
tic similarity is obtained from WordNet distance, web 
searches (Rohrbach et al., 2010), word2vec (Mikolov et al., 
2013), GloVe (Pennington et al., 2014), and from Amazon 
Mechanical Turk. Table 1 summarizes the characteristics 
of the datasets and the types of available semantic informa¬ 
tion used in the experiments. More details on data process¬ 
ing are provided in Appendix. 

4.2. Methods 

We perform comprehensive tests of the probabilistic 
ranking-based (PR) zero-shot model under 1) three learn¬ 
ing settings (one-vs-rest, one-vs-one, multiclass), 2) two 
types of semantic sources (linguistic, crowd-sourcing), 
and 3) different prior models for semantic rankings (the 
Plackett-Luce and the Mallows models). We compare 
probabilistic ranking-based method (PR, Sec. 2.5) to de- 
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Table 1. Datasets used in the experiments. 



Animals 

CIFAR 

Feature dimension 

Number of training/validation/test samples 
Number of training/test classes 

8941 

21847/2448/6180 

40/10 

4000 

50000 / 50000 / 10000 

100/10 

Linguistic sources 

WordNet, Wikipedia, Yahoo, Yahoolm- 
age. Flicker 

WordNet, word2vec (Mikolov et al., 
2013), GloVe (Pennington et al., 2014) 

Number of surveys from crowd-sourcing 

500 

500 


a b c 

0.4 
0.2 
0 

- 0.2 
- 0.4 

- 0.3 - 0.2 - 0.1 0 - 0.6 - 0.4 - 0.2 0 ” - 0.2 0 '’ 0.2 

Figure 3. Multidimensional scaling of similarity data for Animals, 
using Euclidean distance (a), Euclidean distance with normaliza¬ 
tion, and Kendall’s distance (c)(K = 2). The 10 test-domain 
classes are plotted with different symbols and colors. 



terministic ranking-based method (DR, Sec. 2.1)and di¬ 
rect similarity-based method (DS, (Rohrbach et al., 2010)) 
which is the closest state-of-the-art to our methods that uses 
classifier scores. We also refer to other results in the litera¬ 
ture for comparison. 

Regularization parameters for classifiers are determined 
from the validation set and partially manually to avoid ex¬ 
haustive cross-validation. We test with different hyperame- 
ters K (in top-K list) and report the results with K = A. For 
one-vs-rest and one-vs-one, we trained SVMs followed by 
Platt’s probabilistic scaling (Platt, 1999). For multiclass, 
we used multinomial logistic regression. 


test-domain classes when there are multiple heterogeneous 
sources. Figure 3 shows the embeddings from classical 
Multidimensional Scaling (MDS) using these distances. It 
shows qualitative differences of numerical similarity (a and 
b) and ranking (c). The embedding of rankings has better 
between-class separation and within-class clustering than 
the embeddings of numerical similarity, suggesting that the 
non-metric order information is more consistent than nu¬ 
merical similarity across different sources. 

We also compute leave-one-out accuracy of Bayesian clas¬ 
sification with the rankings collected directly from crowd¬ 
sourcing for Animals and CIFAR datasets. Out of 500 
rankings, one ranking is held out and the 499 remaining 
rankings are used to build 10 semantic ranking probabilities 
P(7r|z) using both the Mallows and the Plackett-Luce mod¬ 
els. Prediction of the class of the held-out rankings is made 
by the maximum of P{tt\z) over all 10 classes. For Ani¬ 
mals, the average accuracy was 0.91/0.99/0.99 (iT=2/5/10) 
using the Mallows model, and 0.79/0.84/0.96 (K=2/5/10) 
using the Plackett-Luce model. For CIFAR, the average ac¬ 
curacy was 0.73/0.79/0.84 (76=2/5/10) using the Mallows 
model, and 0.72/0.77/0.74 (76=2/5/10) using the Plackett- 
Luce model. These numbers show that the rankings ob¬ 
tained from crowd-sourcing have information to discrimi¬ 
nate the test-domain classes with up to 0.8 ~ 1.0 accuracy. 

4.4. Result 2 - Comparison of PR, DR, and DS 


4.3. Result 1 - Discriminability of semantic rankings 

We first compare the discriminability of classes with rank¬ 
ing vs numerical representations of similarity without us¬ 
ing image data. Using all five linguistic sources for An¬ 
imals, we compute pairwise distances of 5x10=50 simi¬ 
larity vectors. Two types of distances are computed - the 
Euclidean distance of numerical similarity, with or with¬ 
out 1 1 normalization, and the Hausdorff distance for top-K 
lists using the Kendall’s ranking distance (5). Note that 
the rankings are obtained by sorting the numerically sim¬ 
ilarity. For these different representations, the average ac¬ 
curacy of leave-one-out 1-Nearest Neighbor classification 
was 0.44 (Euclidean), 0.62 (Euclidean with normalization), 
0.72 (Kendall’s, 76=2), 0.70 (Kendall’s, 76=5), and 0.64 
(Kendall’s, 76=10), which shows that the ranking distances 
are better than the Euclidean distances for discriminating 


We compare probabilistic ranking (PR), deterministic rank¬ 
ing (DR) and direct similarity (DS) methods for zero-shot 
classification accuracy. All three methods share the same 
image features and the same linguistic sources of seman¬ 
tic information (except for the crowd-sourcing for PR), but 
use them in different ways. PR uses probabilistic models 
to combine multiple sources of semantic similarity. DR 
and DS inherently use a single source of semantic simi¬ 
larity, and therefore the multiple sources have to be com¬ 
bined heuristically. We first normalize individual similar¬ 
ity sources to be in the range from 0 to 1, and then com¬ 
pute arithmetic and geometric means over multiple sources. 
Note that the main difference between DR and DS, is that 
DR uses rankings whereas DS uses numeric values. 

DS vs DR. The results are shown in Table 2. Eor both 
DR and DS, using averaged semantic similarity (“Arithm” 
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Table 2. Zero-shot classification accuracy of Direct Similarity (DS), Deterministic Ranking (DR), and Probabilistic Ranking-based (RP). 
Each method is tested with different semantic source and classifier types. Indiv. averaged accuracy using individual semantic similarities, 
Arithm: accuracy using arithmetic mean of similarities, Geom: accuracy using geometric mean of similarities. The best result for each 
method is highlighted in boldface. 


Animals dataset 



Direct Similarity (DS) 

Deterministic Ranking (DR) 

Probabilistic Ranking (PR) 

Linguistic sources 

Linguistic sources 

Linguistic sources 

Crowd-source 

Indiv 

Arithm 

Geom 

Indiv 

Arithm 

Geom 

P.-L 

Mallows 

P.-L 

Mallows 

one-vs-rest 

0.320 

0.334 

0.354 

0.329 

0.330 

0.347 

0.320 

0.312 

0.351 

0.351 

one-vs-one 

n/a 

0.341 

0.343 

0.359 

0.358 

0.320 

0.374 

0.374 

multiclass 

n/a 

0.331 

0.345 

0.355 

0.370 

0.345 

0.395 

0.392 


CIFAR dataset 



Direct Similarity (DS) 

Deterministic Ranking (DR) 

Probabilistic Ranking (PR) 

Linguistic sources 

Linguistic sources 

Linguistic sources 

Crowd-source 

Indiv 

Arithm 

Geom 

Indiv 

Arithm 

Geom 

P.-L 

Mallows 

P.-L 

Mallows 

one-vs-rest 

0.273 

0.300 

0.316 

0.224 

0.258 

0.260 

0.314 

0.288 

0.258 

0.282 

one-vs-one 

n/a 

0.244 

0.281 

0.278 

0.335 

0.297 

0.244 

0.261 

multiclass 

n/a 

0.251 

0.272 

0.276 

0.339 

0.320 

0.260 

0.292 


and “Geom”) is better than using individual similarity (“In¬ 
div”), for both Animals and CIFAR datasets. A plausi¬ 
ble interpretation is that the aggregate similarity is more 
reliable than individual similarities despite using heuristic 
methods of aggregation. The highest accuracy from DS is 
0.354 (for Animals) and 0.316 (for CIFAR), whereas the 
hight accuracy from DR is 0.359 (for Animals) and 0.281 
(for CIFAR). DR performs slightly better than DS in Ani¬ 
mals, but worse in CIFAR. Within DR, accuracy is not af¬ 
fected much by the pre-trained classifier type (one-vs-rest, 
one-vs-one, multiclass). 

PR vs others. Using the same linguistic sources, the high¬ 
est accuracy from PR is 0.370 (for Animals) and 0.339 
(for CIFAR) which are much higher than DS and DR re¬ 
gardless of whether a single (Indiv) or multiple (Arithm 
and Geom) sources are used. This suggests the advantage 
of using probabilistic models to aggregate multiple seman¬ 
tic sources. Within PR results, one-vs-rest and one-vs-one 
classifiers perform comparably, and multiclass logistic re¬ 
gression performs the best. PR performs even better with 
crowd-sourced semantic information (0.395) than with lin¬ 
guistic sources (0.370) in Animals, but the opposite is true 
in CIFAR, probably due to the less reliability of human sub¬ 
ject ratings with CIFAR (sorting 100 categories correctly 
compared to 40 in Animals). 

In the literature, the accuracy of attributes-based methods 
with Animals ranges from 0.36 to 0.44 (Tables 3 and 4, 
(Akata et al, 2013)), compared to 0.395 from our method 
which do not use attributes. We remind the reader that find¬ 
ing ‘good’ attributes is itself a non-trivial task. When both 
similarity and attributes are mined automatically from cor¬ 
pora, similarity-based methods perform much better than 
attributed-based methods (individual average of 0.22 from 



cat-dog plane-auto auto-deer deer-ship cat-truck 


Figure 4. Zero-shot classification accuracy for CIFAR. 
Table 1, (Rohrbach et al., 2010)). 

Lastly, Figure 4 shows two-class classification accuracy of 
PR (PL+linguistic sources), DS, and an embedding-based 
method on select pairs of classes from CIFAR (Figure 3, 
(Socher et al., 2013)). Although the numbers may not be 
directly comparable due to different settings^^, PR performs 
noticeably better than the two state-of-the-arts. In fact, we 
can distinguish auto vs deer, deer vs ship, or cat vs truck 
with ~ 95% accuracy, without a single training image of 
these categories. 

5. Conclusion 

In this paper, we propose ranking-based representation of 
semantic similarity, as an alternative to metric representa¬ 
tion of similarity. Using rankings, semantic information 
from multiple sources can be aggregated naturally to pro¬ 
duce a better representation than individual sources. Using 
this representation and probability models of rankings, we 
present new zero-shot classifiers which can be constructed 

‘'Socher et al. used the rest of classes from CIFAR-10 instead 
of CIFAR-100 for training, and also used different semantic in¬ 
formation 
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from pretrained classifiers without retraining, and demon¬ 
strate their potential for exploiting semantic structures of 

real-world visual objects. 
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A. Datasets 

The Animals with Attributes dataset (Animals) was col¬ 
lected and processed by (Lampert et al., 2009). The train¬ 
ing domain consists of images of 40 types of animals, 
from which 21,847 and 2,448 images were used as train¬ 
ing and validation sets. From each image, 10,942 dimen¬ 
sional features are extracted (Lampert et al., 2009). The 
test domain consists of 6,180 images of 10 types of ani¬ 
mals which are non-overlapping with the training-domain 
classes. Semantic similarity of the animals are provided by 
(Rohrbach et al., 2010) which are computed from five 
different linguistic sources: path distance from WordNet, 
co-occurrence from Wikipedia, Yahoo web search, Yahoo 
image Search, and Flickr image search. 

The CIFAR-100 and CIFAR-10 are collected by 
(Krizhevsky, 2009). The training domain (CIFAR- 
100) consists of 60,000 images of 100 types of objects 
including animals, plants, household objects, and scenery. 
We use 50,000 and 10,000 images from CIFAR-100 as 
training and validation sets. The test domain (CIFAR-10) 
consists of 60,000 images of 10 types of objects similar 
to CIFAR-100, without any overlap with the classes from 
CIFAR-100. We use 10,000 images as test data. To 
compute features, we use a deep-trained neural network 

which is trained from ImageNet ILSVRC2010 dataset^ 
consisting of 1.2 million images of 1000 categories. We 
apply CIFAR-100 and CIFAR-10 training images to the 
network, and use the 4096-dimensional output from the 
last hidden layer of the network as features. For semantic 
similarity of CIFAR-100 and CIFAR-10, we compute 
the WordNet path distance, and also used word2vec 
tools (Mikolov et al., 2013) ^ and GloVe tools (Pennington 
et al., 2014)®. 

In addition to using linguistic sources, we use Amazon 
Mechanical Turk to collect word similarity data by crowd¬ 
sourcing. Each participant of the survey is shown a word 
from the test domain classes, and is asked to sort 10 words 
from the training domain according to their perceived sim¬ 
ilarity to the given word. The initial order of 10 words is 
randomized for each survey. We pre-select those 10 closest 
words for each test-domain word, because we found from 
preliminary trials that ordering all words (40 for Animal 
and 100 for CIFAR) is too demanding and time-consuming 
for participants. For Animal, 10 closest words are selected 
based on the average ranking of the words w.r.t. the test- 

^http : //WWW . d2.mpi-inf . mpg.de/nlp4vision 

®https : //github . com/jetpacapp/ 

DeepBeliefSDK 

'^http : //WWW . image-net . org/challenges/ 
LSVRC 

^https : //code.google . com/p/word2vec/ 

® http://nlp.stanford.edu/projects/glove/ 


domain word from the five linguistic sources. For CIFAR, 
we use the path distance from WordNet. Fifty surveys are 
collected for each test-domain class. 

B. Implementation 

The Direct Similarity-based method (DS) (Rohrbach et al., 
2010) is implemented as follows. The probability P{yk\x) 
is modeled by a one-vs-rest binary SVM classifier followed 
by the Platt’s probabilistic scaling (Platt, 1999), trained 
with training-domain feature and label pairs. In testing, the 
probability P{yk\x) is evaluated for a test image, and the 
prediction of the test-domain class is made by MAP esti¬ 
mation using 

« n . yi = (>*) 

where Wy^ is the similarity score. The sum above is lim¬ 
ited to five most similar training-domain classes. We have 
tested different values of the prior P{yk), which did not 
have visible effects on the result. 





